Techniques for Linking Non-Coding and Gene-Coding Deoxyribonucleic Acid Sequences and Applications Thereof

ABSTRACT

Techniques for linking non-coding and gene coding regions of a genome are provided. In one aspect, a method of determining associations between non-coding sequences and gene coding sequences in a genome of an organism comprises the following steps. At least one conserved region is identified from one or more non-coding sequences. Additional instances of the conserved region are located in the untranslated or amino acid coding regions of one or more genes in the organism under consideration, and the conserved region is associated with the one or more biological processes in which these one or more genes participate.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a divisional of pending U.S. application Ser. No.11/367,512, filed Mar. 3, 2006, the disclosure of which is incorporatedby reference herein. The present application claims priority to U.S.provisional applications identified as: Ser. No. 60/658,251 (attorneydocket no. YOR920050130US1), filed Mar. 3, 2005, and entitled“Overrepresentation of Nucleotides in Human Chromosomes and in the 3′Untranslated Regions of Human Genes;” and Ser. No. 60/696,213 (attorneydocket no. YOR920050350US1), filed Jul. 1, 2005, and entitled“Techniques For Linking Non-Coding And Gene-Coding Deoxyribonucleic AcidSequences,” the disclosures of which are incorporated by referenceherein.

FIELD OF THE INVENTION

The present invention relates to genes and, more particularly, totechniques for linking regions of a genome.

BACKGROUND OF THE INVENTION

It is known that the intergenic and intronic regions comprise most ofthe genomic sequence of higher organisms. The intergenic and intronicregions are collectively referred to as the “non-coding region” of anorganism's genome, as opposed to the “gene-” or “protein-coding region”of the genome. Even though recent work suggested participation of theintergenic and intronic regions in a regulatory role, for the most part,their true function remains elusive. The search for conserved motifs,presumed to be regulatory and control signals, in regions upstream ofthe 5′ untranslated regions (5′UTRs) of genes, has been the focus ofresearch activities for many years.

More recently, researchers began studying the 3′ untranslated regions(3′UTRs) of genes where they discovered conserved regions and showedthem to be functionally significant, in direct analogy to the cis-motifsof promoter regions. Large-scale comparative analyses allowedresearchers to also study conservation in the vicinity of genes andelsewhere in the genome with great success. However, these studies werecarried out on only a handful of organisms at a time because of themagnitude of the necessary computations.

The analysis of 3′UTRs intensified further after it was discovered thatthey contain binding sites that are targeted by short interferingribonucleic acids (RNAs) that induce the post-transcriptional control ofthe corresponding gene's expression through either messenger RNA (mRNA)degradation or translational inhibition. Accumulating evidence thatnon-coding RNAs control developmental and physiological processes andthat a considerable part of the human genome is transcribed, has helpedresearchers identify “functional” elements in areas of the genome thatare not associated with protein-coding regions.

Thus, techniques for efficiently and effectively identifying andassociating non-coding regions with gene coding regions of a genomewould be desirable.

SUMMARY OF THE INVENTION

Techniques for linking non-coding and gene coding regions of a genomeare provided, in accordance with an illustrative embodiment of thepresent invention.

In a first aspect of the invention, a method of determining associationsbetween non-coding sequences and gene coding sequences in a genome of anorganism comprises the following steps. At least one conserved region isidentified from a plurality of the non-coding sequences. The at leastone conserved region is linked with one or more of the gene codingsequences of the genome. The at least one conserved region is associatedwith one or more biological processes of the organism.

In a second aspect of the invention, a method of designing one or moresequences of small interfering RNAs that can interact with one or moresites in a given transcript of a given sequence in a given organism andresult in the down-regulation of the expression of the protein productencoded by the given transcript comprises the following steps. One orregions of interest are identified in the sequence of a giventranscript. One or more regions are sub-selected from the collection ofthese regions. One or more derived sequences are generated from thesequence of the one or more sub-selected regions. The one or morederived sequences are used to create one more instances of thecorresponding molecule that the one or more derived sequences represent.The one or more instances of the created molecule are used in anappropriate environment to regulate the expression of the giventranscript.

In a third aspect of the invention, a method of engineering a giventranscript of a given gene in a given organism in order to regulate itsexpression comprises the following steps. One or more regions ofinterest are identified in the sequence of a given transcript. One ormore regions are sub-selected from the collection of these regions. Theone or more sub-selected regions are used to make one or moremodifications to the sequence of the given transcript.

A more complete understanding of the present invention, as well asfurther features and advantages of the present invention, will beobtained by reference to the following detailed description anddrawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram illustrating an exemplary methodology fordetermining associations between non-coding and gene coding sequences ina genome of an organism, according to an embodiment of the presentinvention;

FIG. 2 is a diagram illustrating a probability density function and acumulative distribution for lengths of patterns discovered in analyzedintergenic and intronic sequences of the human genome, according to anembodiment of the present invention;

FIG. 3 is a diagram illustrating a probability density function for anumber of 16-mers with a given number of copies in a random input set,according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a preprocessing methodology, accordingto an embodiment of the present invention;

FIG. 5 is a diagram illustrating pattern specificity and number ofappearances, according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a process for determining logicallydistinct patterns, according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating a probability density function forlengths of pyknons, according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating a number of blocks that are shared bythree sets whose union makes up a pyknon collection, according to anembodiment of the present invention;

FIG. 9 is a diagram illustrating a cumulative function for a number ofintergenic and intronic copies of pyknons, according to an embodiment ofthe present invention;

FIG. 10 is a diagram illustrating a cumulative function showing whatpercentage of affected transcripts contain N or more pyknon instances,according to an embodiment of the present invention;

FIG. 11 is a diagram illustrating combinatorial arrangements of pyknonsin 3′UTRs, according to an embodiment of the present invention;

FIG. 12 is a diagram illustrating combinatorial arrangements of pyknonsin 5′UTRs, according to an embodiment of the present invention;

FIG. 13 is a diagram illustrating combinatorial arrangements of pyknonsin amino acid-coding regions, according to an embodiment of the presentinvention;

FIG. 14 is a diagram illustrating a probability density function andcorresponding cumulative function for variable representing the fractionof pyknon instances located in repeat-free regions, according to anembodiment of the present invention;

FIG. 15 is a diagram illustrating a partial list of biological processeswhose corresponding genes show significant enrichment or depletion inpyknon instances in their 5′UTRs, CRs or 3′UTRs, according to anembodiment of the present invention;

FIG. 16 is a diagram illustrating probability density functions for thedistance between starting points of consecutive instances of pyknons,according to an embodiment of the present invention;

FIG. 17 is a diagram illustrating probability density functions for thenumber of intergenic and intronic copies of variable-length stringsderived by counting instances of 3′UTR-conserved pyknons after shifting,according to an embodiment of the present invention;

FIG. 18 is a diagram illustrating the number of intergenic/intronicneighborhoods each of which contains the reverse complement of a pyknonand is predicted to fold into a hairpin-like structure, according to anembodiment of the present invention;

FIG. 19 is a diagram illustrating the number of positions per 10,000nucleotides that human pyknons cover in other genomes, according to anembodiment of the present invention;

FIG. 20 is a diagram illustrating the number of human pyknons that canbe found in the untranslated and coding regions of other genomes, andthe number of intergenic/intronic positions that human pyknons cover inother genomes, according to an embodiment of the present invention;

FIG. 21 is a diagram illustrating a first methodology for using pyknons,according to an embodiment of the present invention;

FIG. 22 is a diagram illustrating a second methodology for usingpyknons, according to an embodiment of the present invention; and

FIG. 23 is a block diagram of an exemplary hardware implementation ofone or more of the methodologies of the present invention, according toan embodiment of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The sequence listings referred to herein can be found in electronic textformat as the file, 1500-675 PatentIn Sequence Listing. ST25, created onWednesday, Jul. 12, 2006, having a size of 102 Kilobytes, the contentsof which are incorporated by reference herein.

FIG. 1 is a diagram illustrating exemplary methodology 100 fordetermining associations between non-coding and gene coding sequences ina genome of an organism. In step 102, non-coding sequences from thegenome of an organism are obtained. Preferably, the non-coding sequencescomprise intergenic and/or intronic sequences. As used herein, the term“intergenic” refers generally to any portion of a deoxyribonucleic acid(DNA) strand that is between gene sequences. Further, as used herein,the term “intronic” refers to any portion of a DNA strand that isencompassed by an intron.

According to one exemplary embodiment, the genome comprises the humangenome. Further, as will be described below, the starting point of thepresent techniques may be the genome of a single organism, e.g., thehuman genome.

In step 104, conserved regions in the intergenic/intronic sequences areidentified. As will be described in detail below, these conservedregions may be identified by used pattern discovery techniques toidentify patterns existing in the sequences.

In step 106, the identified conserved regions (also referred to as‘conserved motifs’) of the intergenic/intronic non-coding sequences arelinked to gene coding regions of the genome. Specifically, instances ofthe patterns, described, for example, in conjunction with thedescription of step 104, above, may be searched for in gene codingregions of the genome. For example, as will be described in detailbelow, sequences will be identified, e.g., that are at least 16nucleotide bases in length and appear a minimum of 40 times in thenon-coding region of a genome, and which also appear at least one timein the coding region of the genome. These identified sequences (motifs)link the coding and non-coding regions of the genome. As will also bedescribed in detail below, linking the conserved regions and gene codingregions of the genome provides for an association to be made with thebiological processes of the organism.

In step 108, as will be described in detail below, the identified motifsthat link the non-genic and genic regions of a genome also provide foran association between these motifs and specific biological processesthat even persists in organisms beyond the human genome.

As used herein, the phrase “conserved region” may also be referred to asa “conserved motif” or a “conserved block” or “anexceptionally-well-conserved block (EWCB)” or a “pyknon.” The term“pyknon” is from the Greek adjective πυκνóç/πυκν{acute over (η)}/πυκνóνmeaning “serried, dense, frequent.”

Accordingly, using an unsupervised pattern discovery method, weprocessed the human intergenic and intronic regions and catalogued allvariable-length patterns with identically conserved copies andmultiplicities above what is expected by chance. Among the millions ofdiscovered patterns, we found a subset of 127,998 patterns, termedpyknons, which have one or more additional non-overlapping instances inthe untranslated and protein-coding regions of 30,675 transcripts from20,059 human genes. The pyknons are found to arrange combinatorially inthe S′ untranslated, coding, and 3′ untranslated regions of numeroushuman genes where they form mosaics. Consecutive instances of pyknons inthese regions exhibit a strong bias in their relative placement favoringdistances of about 22 nucleotides. We also found that pyknons areenriched in a statistically significant manner in genes involved inspecific processes, e.g., cell communication, transcription, regulationof transcription, signaling, transport, etc. For about ⅓^(rd) of thepyknons, the intergenic/intronic instances of their reverse complementlie within 382,244 non-overlapping regions, typically 60-80 nucleotideslong, which are predicted to form double-stranded, energetically stable,hairpin-shaped RNA secondary structures; additionally, the pyknonssubsume about 40% of the known microRNA sequences, thus suggesting apossible link with post-transcriptional gene silencing and RNAinterference. Cross-genome comparisons revealed that many of the pyknonsare also present in the 3′UTRs of genes from other vertebrates andinvertebrates where they are over-represented in similar biologicalprocesses as in the human genome. These novel and unexpected findingssuggest potential functional connections between the coding andnon-coding parts of the human genome.

Thus, in accordance with an illustrative methodology of the invention,we examine whether highly specific patterns exist within a single genomethat may act as targets or sources for such putative regulatory activityor as a ‘vocabulary’ for yet undiscovered mechanisms. Our analysisrepresents a substantial point of departure from previous efforts.First, we carry out all of the analysis on a single genome. Second, weseek patterns in the intergenic and intronic regions of the genome (notthe UTRs or protein coding) regions. Third, the pattern instancestranscend chromosomal boundaries. And, fourth, we rely on theunsupervised discovery of conserved motifs instead of searching schemes.In particular, we sought to discover identically conserved,variable-length motifs of certain minimum length but unlimited maximumlength in human intergenic and intronic regions. We discovered more than66 million motifs with multiplicities well above what is expected bychance. A sizeable subset of these motifs, referred to as the pyknons,have one or more additional instances in the untranslated and codingregions of almost all known human genes and exhibit properties thatsuggest the possibility of an extensive link between the non-genic andgenic regions of the genome and a connection with post-transcriptionalgene silencing (PTGS) and RNA interference (RNAi).

As described, for example, in conjunction with the description of step104 of FIG. 1, above, according to the techniques described herein, apattern discovery step may be performed. We used the parallel version ofa pattern discovery algorithm described in I. Rigoutsos et al.,Combinatorial Pattern Discovery in Biological Sequences: the TEIRESIASAlgorithm, 14 BIOINFORMATICS 1, pgs. 55-67 (January 1998) (hereinafter“Rigoutsos”), the disclosure of which is incorporated by referenceherein. The pattern discovery (Teiresias) algorithm seeksvariable-length motifs that are identically conserved across all oftheir instances, comprise a minimum of L=16 nucleotides and appear aminimum of K=40 copies in the processed input (see below regarding thevalues of L, K). The algorithm guarantees the reporting of allcomposition-maximal and length-maximal patterns satisfying theseparameters. The input comprised the intergenic and intronic sequences(step 102 of FIG. 1) of the human genome from ENSEMBL Rel. 31 (seeStabenau, A., McVicker, G., Melsopp, C., Proctor, G., Clamp, M. &Birney, E. (2004) Genome Res 14, 929-33) for a total of 6,039,720,050nucleotides. The input did not include the reverse complement of the 5′untranslated, amino acid coding or 3′ untranslated regions of any humangenes. This exclusion ensures that any discovered patterns are notconnected in any way to sequences of known genes, protein motifs ordomains. The algorithm ran on a shared-memory architecture with 128Gigabytes of main memory and 8 processors running at a clock frequencyof 1.75 GHz, and generated an initial set P_(init) of 66+ millionstatistically significant patterns (see below). Most of the patterns inP_(init) were a few tens of nucleotides in length. FIG. 2 shows theprobability density function (in black) and cumulative distribution (inlight gray) for the lengths of the more than 66 million patternsdiscovered in the analyzed intergenic and intronic sequences of thehuman genome. These patterns form the set P_(init). As can be seen fromFIG. 2, more than 95% of all discovered patterns are shorter than 100nucleotides. Note that the primary Y-axis is logarithmic whereas thesecondary is linear.

The Teiresias discovery algorithm that we used for this analysisrequires the setting of three parameters: L, W and K. The parameter Lcontrols the minimum possible size of the discovered patterns but has nobearing on the patterns' maximum length; the latter is not constrainedin any way. The parameter W satisfies the inequality W≧L and controlsthe ‘degree of conservation’ across the various instances of thereported patterns: smaller (respectively, larger) values of W willtolerate fewer (respectively, more) mismatches across the instances.Since for this analysis, we are interested only in patterns withidentically conserved instances, we set W=L (i.e., the patternscontained no “wild cards”). Finally, the parameter K controls theminimum required number of appearances before a pattern can be reportedby the algorithm.

For a given choice of t, W and K, the algorithm guarantees the reportingof all patterns that have K or more appearances in the processed inputand are such that any L consecutive (but not necessarily contiguous)positions span at most W positions. These patterns are generallyoverlapping: a given sequence location can simultaneously appear inmultiple, distinct, non-redundant patterns. It is also important tostress three properties of the algorithm. First, as stated above, thevalue L does not impose any constraint on the maximum length of apattern which is unbounded. Second, each reported pattern will bemaximal in composition, i.e., it cannot be made more specific byspecifying the value of a wild-card without decreasing the number oflocations where it appears. And, third, each reported pattern will bemaximal in length, i.e., it cannot be made longer without decreasing thenumber of locations where it appears. In this discussion, we use theterms pattern, block and motif interchangeably.

Opting for small L values generally permits the identification ofshorter conserved motifs that may be present in the processed input, inaddition to all longer ones—see above properties. Generally, for shortmotifs to be claimed as statistically significant they need to have alarge number of copies in the processed input; requiring a lot of copiesruns the risk of discarding bona fide motifs. On the other hand, largervalues of L will generally permit the identification of statisticallysignificant motifs even if these motifs repeat only a small number oftimes. This happens at the expense of significant decreases insensitivity; i.e. bona fide motifs will be missed.

For our analysis, we have selected L=16, a value that strikes a balancebetween the desirable sensitivity (which favors lower L values) andachievable specificity (which favors higher L values). We stress thatthe maximality properties of the pattern discovery step ensure that wewill be able to report any and all motifs that are 16 nucleotides orlonger. And as explained above, we will set W=L.

The last parameter that needs to be set is K, the required number ofappearances for a pattern to be reported. K needs to be set to a valuethat can ensure that the reported patterns could not have been derivedfrom a random database with the same size as the input at hand. In orderto determine this value, we used several randomly-shuffled versions ofour intergenic and intronic input (of approximately 6 billion bases) andin there sought frequent, fixed-size 16-mers with all low-complexity16-mers removed by NSEG (see Wootton, J. C. & Federhen, S. (1993)Computers in Chemistry 17, 149-163). The idea here is that if arandomly-shuffled version of our input set cannot give rise to any16-mers that appear more than K_(X) times, then it will also be truethat no patterns exist in the input set that are longer than 16nucleotides and have more than K_(X) copies. Several iterations of thisprocess allowed us to establish that K_(X)=23. FIG. 3 shows theprobability density function for the number of 16-mers with a givennumber of copies in the random input set—note that both the X and Y axesare logarithmic. From this, it follows that a randomly-shuffled versionof our input set cannot possibly give rise to patterns which are longerthan 16 nucleotides and have more than 23 copies: in fact, as a patternincreases in length, the number of times it appears in a given input setcan only decrease. We thus opted for the even larger threshold of K=40for our pattern discovery step.

Before we sought to discover patterns in the intergenic and intronicregions of the human genome, we preprocessed the sequences and removed:a) all the regions that corresponded to 5′ untranslated, coding and 3′untranslated regions of known genes; and, b) all the regions that werethe reverse complement of 5′ untranslated, coding and 3′ untranslatedregions of known genes. We show this preprocessing step pictorially inFIG. 4. The genomic input before the preprocessing step is shown abovethe arrow, and the input upon which pattern discovery is run is shownbelow the arrow.

Under the assumption that all four nucleotides are equiprobable (i.e.,p_(A)=p_(T)=p_(C)=p_(G)=¼), independent, and, identically distributed,we estimate the probability p of a pattern of length l to be p=4^(−l).We can compute the probability Pr_(k) to observe k instances of a givenpattern in a database of size D (D>1) to be Pr_(k)≈(pD)^(k)e^(−pD)/k!(Poisson distribution). The least specific pattern that our method willdiscover is one that is the shortest possible (i.e., l=L=16) and appearsthe smallest allowed number of times (i.e., k=K=40): if D=6.0×10⁹ bases(=all chromosomes and both strands), then Pr_(k)=1.95×10⁻⁴³.

We now revisit this calculation by taking into account the nucleotides'natural probability of occurrence. Using ENSEMBL Release 31 from May2005 (based on NCBI Assembly 35 from July 2004) as our database D, wesee that the fraction of bases that are undetermined across the 24 humanchromosomes ranges from roughly 1.2 to 61.0% for the Y chromosome. Ofcourse, the following constraints should apply: p_(A)=p_(T) andp_(C)=p_(G). Since the fractions of nucleotides that are undeterminedare not equal, the required balance between A/T and C/G is onlyapproximately preserved. Ignoring the unspecified positions andrecomputing ratios based on the remaining bases, we find thatp_(A)=p_(T)≈3/10 and p_(C)=p_(G)≈2/10.

Let us consider a block of size l and let “match” indicate the matchbetween the i-th character of this block and a character c at positionin a database D of nucleotide sequences. Then it is easy to see that:

Pr (match) = Pr (match  with  c) = Pr (match    (c  is  one  of  A, C, G, T)) = Pr ((match    c = A)    (match    c = C)    (match    c = G)    (match    c = T)) = Pr (matchA)Pr (A) + Pr (matchC)Pr (C) + Pr (matchG)Pr (G) + Pr (matchT)P(T) = Pr (A)Pr (A) + Pr (C)Pr (C) + Pr (G)Pr (G) + Pr (T)P(T) = Pr (A)² + Pr (C)² + Pr (G)² + Pr (T)² = p_(A)² + p_(C)² + p_(G)² + p_(T)² = 0.3² + 0.3² + 0.2² + 0.2² = 0.26

In this analysis, we consider blocks of length l with l≧16. Naturally,these shortest blocks will be associated with the largest probability pof observing a pattern accidentally—the value p decreases as the valueof l increases. The probability that a block of length l=16 will haveone instance in the database D is then p₁

Pr(match)¹⁶=(0.26)¹⁶ or p₁=4.4*10⁻¹⁰.

An alternative way to approach this is to assume that the block oflength l is constructed by drawing from the same nucleotide distributionthat gives rise to the database D. Then, a block of length l=16 wouldcomprise p_(A)*16≈5 A's, p_(C)*16≈5 C's, p_(G)*16≈3 G's and p_(T)*16≈3T's. Then, the probability that this block will arise accidentally is p₂

p_(A) ⁵*p_(C) ⁵*p_(G) ³*p_(T) ³=3.8*10⁻¹⁰.

We can compute the probability of finding k accidental instances in adatabase D that contains 6×10⁹ bases where each of the instances isindependent of all the preceding instances using the Poissondistribution Pr_(k)≈(pD)^(k)e^(−pD)/k!. The probability Pr_(k) that a16-mer will appear k times with k=40 is equal to 4.5*10⁻³³ (resp.2.6*10⁻³⁵) if p₁ (resp. p₂) is used in the calculation.

We thus can see that even if we take into account the natural frequencyof appearance in the human genome of each of the four nucleotides, theprobability that one of our discovered blocks is accidental remains verysmall even for blocks of size 16 that appear only 40 times.

Alternatively, we can estimate the significance of our patterns usingz-scores: for the least specific patterns of length 16 that have exactly40 identical copies we obtain the remarkably-high value of z=32.66.Longer patterns and patterns with more intergenic/intronic copies haveeven higher z-scores. These analyses confirm in different ways thatevery one of our discovered patterns is statistically significant andnot the result of a random process. These conclusions hold true for thereverse complements of the discovered patterns as well and for thepyknons that are a subset of the discovered patterns P_(init).

It is to be noted that we will use the terms “coding” and “codingregion” (abbreviated as CR and CRs) to refer to the translated,amino-acid coding part of exons.

We now describe the step of determining which of the discovered patternshave additional instances in the 5′UTRs, CRs or 3′UTRs of known genes.Once the pattern discovery step has produced the set P_(init) ofvariable length patterns, we processed it to identify ‘logicallydistinct’ patterns using the following approach. Let there be arecurrent logical unit which appears several times in theintergenic/intronic regions of the human genome; each one of itsinstances is assumed to have different lengths that reflect varyingdegrees of conservation. For simplicity, we assume here that differentdegrees of conservation will result in variable length instances of thepattern. We only seek patterns with identically-conserved instances sothis is a correct assumption. For example's sake, we will assume thatall variations of the logical unit contain an intact copy of an18-nucleotide core motif; let TCCCATACCACGGGGATT (SEQ ID NO: 1)represent this core. As the instances of the logical unit become longerand thus more specific, the number of appearances in the inputdecreases. FIG. 5 shows this example in more detail. Severalhypothetical variations of the logical unit are aligned around thecommon core motif and the number of instances is listed next to eachvariation.

We reasoned that these patterns should be processed in order ofdecreasing value of the total number of positions that they span: thisnumber is simply the product of each pattern's length by the number oftimes it appears in the input. As patterns are examined in turn, some ofthem are selected and kept whereas others collide with earlier-madeselections.

Two collision scenarios are possible and we examine them with the helpof FIG. 6. Two blocks, light and darker gray, are shown therein togetherwith a ‘reference set’ of sequences. The light gray block corresponds toa pattern that has already been examined and placed at all itsinstances. The instances of the darker gray block show the intendedplacements for the pattern currently under consideration. The blockscollide at two locations (they overlap in the first and second sequence)but the rest of their instances are disjoint. We have two possibilitiesregarding the handling of collisions. The darker gray block is kept ifand only if there is at least one other location in the referencesequence set where it can be placed without generating a collision (e.g.the fifth and sixth sequences in FIG. 6). Alternatively, the darker grayblock is kept if and only if it generates no collisions whatsoever withany block that has already been selected and placed. We have opted forthe stricter, second choice: if a pattern's instance uses a positionthat has already been claimed by an earlier-selected pattern, then thepattern under consideration will be discarded and not consideredfurther. Generally, it will be redundant variations of the same patternthat will generate collisions: only one pattern will be used torepresent a core motif such as the one shown in FIG. 5.

The one remaining element is to decide which sequences to use as thereference set. We have chosen to use each of the 5′UTRs, CRs, and 3′UTRsin turn. Sub-selecting among the patterns in P_(init) with the help ofeach of the 5′ untranslated, coding and 3′ untranslated regions givesrise to the pattern collections P_(5′UTR), P_(CODING) and P_(3′UTR)respectively. The union of these sets, P_(5′UTR) U P_(CODING) UP_(3′UTR) comprises the pyknons, i.e., patterns that were originallydiscovered in the intergenic and intronic regions of the human genomeand which have additional instances in the 5′ untranslated, coding and3′ untranslated regions of known human genes.

We used the above steps to determine which of the discovered patternshas additional instances in the untranslated and coding regions ofgenes. After filtering the surviving patterns for low-complexity withNSEG (Wootton, J. C. & Federhen, S. (1993) Computers in Chemistry 17,149-163), we generated three patterns sets P_(5′UTR), P_(CODING) andP_(3′UTR) that contained 12,267, 54,396 and 67,544 patterns respectivelyand had additional instances in 5′UTRs, CRs and 3′UTRs. The union ofP_(5′UTR) U P_(CODING) U P_(3′UTR) contained 127,998 patterns indicatingthat the three pattern sets are largely disjoint. We refer to these127,998 patterns as pyknons.

We know describe some properties of the pyknons. The three patterns setsP_(5′UTR), P_(CODING) and P_(3′UTR) contain 12,267, 54,396 and 67,544blocks respectively. The union P_(5′UTR)U P_(CODING) U P_(3′UTR)comprises the 127,998 pyknons. In FIG. 7, we show the probabilitydensity function for the length of the pyknons; the function is shownseparately for each of the three subsets that make up the pyknoncollection. Note that the Y-axis is logarithmic.

The patterns in each of the three collections, P_(5′UTR), P_(CODING) andP_(3′UTR), fall into one of two types. “Type-A” patterns are patternswhose reverse complement is also present in the same collection (notethat reverse palindromes are included among the type-A patterns).“Type-B” patterns are patterns whose reverse complement is absent fromthe collection. The breakdown for each of P_(5′UTR), P_(CODING) andP_(3′UTR) is as follows: P_(5′UTR) contains 217 type-A blocks and 11,835type-B blocks; P_(CODING) contains 1,038 type-A blocks and 52,330 type-Bblocks; and P_(3′UTR) contains 2,501 type-A blocks and 62,577 type-Bblocks. The clear majority of the blocks in each of the threecollections are type-B blocks.

With respect to their content, the three collections are largelydisjoint, a characteristic that presumably reflects sequence differencesthat are inherent to the actual 5′UTRs, CRs and 3′UTRs. FIG. 8 showspictorially the relationship among the members of the three setsP_(5′UTR), P_(CODING) and P_(3′UTR): note the small cardinalities of thevarious intersections.

Finally, we comment on the number of intergenic and intronic copies of apyknon. This number spans a very wide range of values with the mostfrequent pyknon having 356,989 copies—the minimum number of copies is,by design, equal to K=40. For about 95% of the pyknons, theirintergenic/intronic copies are fewer than 2,000. FIG. 9 shows thecumulative distributions for the number of intergenic and introniccopies of the pyknons—the distribution is again shown separately foreach of P_(5′UTR), P_(CODING) and P_(3′UTR) in order to highlight thesimilarities and differences of the three sets.

The pyknons also exhibit a number of properties that connect thenon-genic and genic regions of the human genome, as well as othergenomes, in unexpected ways. In particular:

-   -   The pyknons have one or more instances within nearly all known        genes. The 127,998 pyknons that we originally discovered in the        human intergenic and intronic regions have an additional 226,874        non-overlapping copies in the 5′UTRs, CRs or 3′UTRs of 20,059        genes (30,675 transcripts). That is, more than 90% of all human        genes contain one or more pyknon instances. The pyknons in        P_(5′UTR) cover 3.82% of the 6,947,437 nucleotides in human        5′UTRs; the pyknons in P_(CODING) cover 3.04% of the 50,737,024        nucleotides in human CRs; and, the pyknons in P_(3′UTR) cover        7.33% of the 25,597,040 nucleotides in human 3′UTRs. The        distribution of pyknons in the various transcripts is not        uniform. FIG. 10 shows the cumulative for the number of        transcripts with a given number of pyknons instances in them. As        can be seen, about 52% of the 30,675 affected transcripts        contain four or more pyknon instances; of these about 2,200        transcripts contain 20 or more pyknon instances in them.

The pyknons arrange combinatorially in many human 5′UTRs, CRs and 3′UTRsforming mosaics. In those cases where we find many pyknons in onetranscript, the pyknons arrange combinatorially and form mosaics. FIG. 1shows an example of such a combinatorial arrangement in the 3′UTRs ofbirc4 (an apoptosis inhibitor) and nine other human genes. The 3′UTR ofbirc4 contains 100 instances of 95 distinct pyknons: of these, 22 arealso present in the 3′UTRs of the other nine genes shown. One or moreinstances of the 95 pyknons from birc4's 3′UTR exist in the 3′UTRs of2,306 transcripts (data not shown).

We next show two more examples, one involving 5′ untranslated and theother involving coding regions. It is important to stress here that thepyknons are initially discovered in an input that includes neitheruntranslated/amino-acid-coding sequences nor their reverse-complement;thus, pyknon arrangements such as the ones shown in the following twoexamples represent non-trivial findings from the standpoint ofstatistical significance. FIG. 12 shows an example of combinatorialrearrangement in the 5′UTRs of ENSG00000196809 a gene of unknownfunction and 8 more human genes. 63 distinct pyknons have a total of 65instances in the 5′UTR of ENSG00000196809. Of the 63 pyknons in the5′UTR of ENSG00000196809, nine are also shared with the remaining eightgenes of the shown group.

FIG. 13 shows an example of combinatorial rearrangement in human codingregions with the help of the amino-acid-coding sequences from 10distinct genes: 9 pyknons have a total of 124 instances in the codingregions of the shown transcripts with several of the conserved pyknonsappearing twice or more in a given sequence.

Recall that we initially discovered the pyknons in an input thatincluded neither transcribed gene-related sequences nor theirreverse-complement. Thus, finding so many pyknons with instances inhuman 5′UTRs, CRs and 3′UTRs is significant, especially in view of thethree striking examples of combinatorial rearrangements shown above.

The pyknons account for ⅙^(th) of the human intergenic and intronicregions. The intergenic and intronic copies of the pyknons span692,393,548 positions on the forward and reverse strands. For thosepyknons whose reverse complements are not already in the list of 127,998pyknons, their Watson-strand instances impose constraints on theirCrick-strand instances. Considering this and recalculating shows that898,424,004 positions, i.e., about ⅙^(th) of the humanintergenic/intronic regions, are covered by pyknons and their reversecomplement.

The pyknons are non-redundant. We clustered the pyknons using a schemebased on BLASTN (Altschul, S. F., Gish, W., Miller, W., Myers, E. W. &Lipman, D. J. (1990) J Mol Biol 215, 403-10). Two pyknons are redundantif they agree on at least X % of their positions. Since our collectionincludes pyknon pairs whose members are the reverse complement of oneanother, we had to ensure that the clustering scheme did not over-count:when comparing sequences A and B, we examined for redundancy the pair(A,B) and the pair (reverse-complement-of-A,B). Clustering at X=70, 80and 90%, we generated clusters with 32621, 44417 and 89159 pyknonsrespectively. The high numbers of surviving clusters show that thepyknons are largely distinct.

We next describe the BLASTN-based clustering scheme. Let us assume thatwe are given a set of N sequences of nucleic acids of variable length,and a user-defined threshold X for the permitted, maximum remainingpair-wise sequence similarity. Then, we carry out the following steps:

▪ sort the N sequences in order of decreasing length ; let S_(i) denotethe i-th sequence of the sorted set ; let S_(l) be the longest sequenceof the sorted set ; ▪ CLEANED_UP_SET ← S_(l) ▪ for i = 2 through N do ▪use S_(i) as query to run BLAST against the current contents of CLEAN ▪if the top BLAST hit T agrees with S_(i) or with the reverse complementof S_(i) at more than X% of T's positions then make S_(i) a member ofthe cluster represented by T ; discard S_(i) ; else CLEANED_UP_SET ←CLEANED_UP_SET U { S_(i) } ; end-for-loop

Upon termination, the set CLEANED_UP_SET contains sequences no pair ofwhich agrees on more than X % of the positions in the shorter of the twosequences.

On pyknons and repeat elements. 1,292 pyknons (1.0%) have instancesoccurring exclusively inside repeat elements as determined with the helpof RepeatMasker (Smit, A. & Green, P. RepeatMasker:ftp.genome.washington.edu/RM/RepeatMasker.html). Seventy-ninepyknonshave instances exclusively in repeat-free regions. And, the remaining126,627 pyknons (98.9% of total) have instances both inside repeatelements and in repeat-free regions. A question that arises here is whatfraction, on average, of the total number of copies of pyknons isgenerated from repeat-free regions. We have computed the probabilitydensity and cumulative functions for this fraction, and plot them inFIG. 14. As can be seen, about 60% of the pyknons have more than 90% oftheir copies inside repeat elements. However, the remaining 40% of thepyknons, which amounts to a little more than 50,000 pyknons, havebetween 10% and 100% of their instances in regions that are free ofrepeats.

The pyknons are distinct from the “ultraconserved elements.” 52 pyknonshave instances in 46 of the 481 ultraconserved elements (Bejerano, G.,Pheasant, M., Makunin, I., Stephen, S., Kent, W. J., Mattick, J. S. &Haussler, D. (2004) Science 304, 1321-5) and cover 0.67% of the 126,007positions: uc.73+ contains four pyknons; uc.23+, uc.66+, ue.143+ anduc.414+ each contain two pyknons; the remaining 41 elements contain asingle pyknon each.

The pyknons are associated with specific biological processes. For 663GO terms (Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D.,Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S.,Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis,A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G.M. & Sherlock, G. (2000) Nat Genet 25, 25-9) describing biologicalprocesses at varying levels of detail, we found that the correspondinggenes had either a significant enrichment or a significant depletion inpyknon instances. FIG. 15 shows a partial list of GO terms that areenriched or depleted in pyknons.

We determined these associations as follows: each gene was included in alist n times, where n is the number of pyknons found in its 5′untranslated, coding or 3′ untranslated region, respectively—to avoidover-counting, pyknons with multiple instances in the transcript(s) of agiven gene were counted only once. For the sets of sequences belongingto human 5′UTRs, CRs and 3′UTRs, respectively, the binomial distributionwas used to estimate the significance of enrichment (or depletion) ofpyknons encountered in a group of genes associated with a certain term,compared to the expected frequency of this term in a background setcomprising all genes with 5′ untranslated, coding or 3, untranslatedregions respectively.

Two control tests helped ensure the significance of our findings. First,we generated gene lists identical to the ones derived from the real databut which were created by random associations with pyknons: we foundthat only 1 of the generated 84,780 p-values exceeded our selectedsignificance threshold of a Bonferroni-corrected|log(p-value)| of about2.3 (data not shown). Second, we examined the relation betweenGO-process associations and the amount of sequence covered by thepyknons: this test allowed us to rule out the possibility that thederived significant enrichment/depletion were due to variations insequence length for the genes associated with given cellular processes.

The relative positioning of pyknons in 5′UTRs, CRs and 3′UTRs isstrongly biased but consecutive pyknon instances are not correlated. Weexamined the distances between consecutive pyknons, independently foreach of the 5′UTRs, CRs and 3′UTRs. FIG. 16 shows the calculatedprobability density functions. Given the stringent criteria, we usedwhen selecting pyknons, the coverage of each region is not dense, hencethe tail-heavy distributions. The three curves have similar shapes,pronounced peaks at abscissas 18 and 22, and an overall preference fordistances between 18 and 31 nucleotides.

We next examined whether or not the pyknons are fragments of largerconserved regions. Let b denote a pyknon and let us assume that,unbeknownst to us, b is part of a larger-size conserved unit B. Then Bwill correspond to a larger area than the instance carved out by b, andthus there will be length(B)−length(b)+1 strings in the immediateneighborhood of b whose intergenic and intronic counterparts have asmany identically-conserved copies as B. We tested this possibility in3′UTRs by taking each instance of a pyknon, shifting it by +d (resp.−d), generating a new string and locating the new string's instances inthe human intergenic and intronic regions. Had b been part of a largerlogical unit, then for some values of d the number of intergenic andintronic copies of the newly formed string would have remained identicalto those of b. On the other hand, if b were not part of a larger unit,then the new string would now cross the “natural boundaries” of theunderlying presumed logical units and the new string'sintergenic/intronic copies would be reduced drastically. Given thestrict criteria that we used in identifying pyknons, it is possible thatwe discarded blocks that are conserved in intergenic/intronic regionsand have instances in human coding regions. In this case, a shift of+/−d may end up generating a string that was not included in our set ofpyknons but continues to have numerous intergenic/intronic copies. FIG.17 shows the results obtained for the 3′UTRs for dε and separately forintergenic (top half) and intronic (bottom half) regions; the curves ford=0 correspond to the pyknons in P_(3′UTR). Note that even for a smallshift of d=2 positions, the derived, shifted strings have strikinglyfewer copies than the pyknons in P_(3′UTR), and this holds true for boththe intergenic and intronic instances. We obtained similar results fornegative d's (data not shown).

The pyknons are possibly linked to PTGS. The most conspicuous feature ofFIG. 16 is the strong preference for distances typically encountered inthe context of PTGS. By definition, the 127,998 pyknons have one or moreinstances in the untranslated and coding regions of human genes: foreach pyknon, we generated its reverse complement β, identified all ofβ's intergenic and intronic instances, and predicted the RNA structureand folding energy of the immediately surrounding neighborhoods usingthe Vienna package (Hofacker, I. L., Fontana, W., Stadler, P.,Bonhoeffer, L. S., Tacker, M. & Schuster, P. (1994) Monatshefte f.Chemie 125, 167-188). We discarded structures whose predicted foldingenergies were >−30 Kcal/mol, and structures (including ones withfavorable folding energies) that were predicted to locallyself-hybridize, even if the involved positions represented a minisculefraction of the total length of the regions under consideration. We alsodiscarded structures that contained either a single large bulge or manyunmatched bases. Each of the surviving regions was predicted to foldinto a hairpin-shaped RNA structure that had a straightforwardarm-loop-arm architecture, contained very small bulges if any, and wasenergetically very stable. The analysis identified 380,084non-overlapping regions predicted to form hairpin-shaped structures(298,197 in intergenic and 81,887 in intronic sequences). These 380,084regions contained instances of the reverse complement of 37,421 pyknons(29.23% of total). In terms of length, the clear majority of theseregions are between 60 and 80 nucleotides long.

FIG. 18 shows the density of the surviving regions per 10,000nucleotides and for each chromosome separately. The density is reportedfor each chromosome and separately for the intergenic and intronicregions. Per unit length, there are more predicted hairpins in intronicrather than intergenic regions but the shear difference in the magnitudeof these regions results in the intergenic regions contributing the bulkof the hairpins. Interestingly, the density of discovered hairpins isnot constant across chromosomes: chromosomes 16, 17, 19 and 22 who arethe most densely-packed in terms of predicted hairpins are also amongthe shortest in length. We emphasize that the average pyknon has lengthsimilar to that of a typical microRNA and that there is astraightforward sense-antisense relationship between segments of the380,084 hairpins and the pyknons instances in human 5′UTRs/CRs/3′UTRs.Also note that the regions containing the 81,887 intronic hairpins willbe transcribed: these regions account for 21,727 of the 37,421 pyknonsthat are linked to hairpins.

If pyknons are indeed connected to PTGS, then two hypotheses arise fromFIG. 16: a) in addition to 3′UTRs, gene silencing is likely effectedthrough the 5′UTRs and amino acid coding regions; and, b) RNAi productsin animals likely form distinct quantized categories based on size andhave preferences for lengths of 18, 22, 24, 26, 29, 30 and 31nucleotides.

The pyknons relate to known microRNAs. We formed the union of the RFAM(Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A. & Eddy, S.R. (2003) Nucleic Acids Res 31, 439-41) and pyknon collections, andclustered it with the above-described BLASTN-based scheme, using athreshold of pair-wise remaining sequence similarity of 70%; i.e., weallowed up to six mismatches in 22 nucleotides. When comparing twosequences A and B, we avoided over-counting by examining for redundancythe pairs (A,B) and (reverse-complement-of-A,B). In total, 1,087 knownmicroRNAs clustered with 689 pyknons across 279 of the 32,994 formedclusters.

The pyknons relate to recently discovered 3′UTR motifs. We compared thepyknons in P_(3′UTR) to the 72 8-mer motifs that were recently reportedto be conserved in human, mouse, rat and dog 3′UTRs (Xie, X., Lu, J.,Kulbokas, E. J., Golub, T. R., Mootha, V., Lindblad-Toh, K., Lander, E.S. & Kellis, M. (2005) Nature 434, 338-45). We say that one of these8-mers coincides with a pyknon of length l if one of the followingconditions holds: the 8-mer agrees with letters l-7 through l of apyknon (‘type 0’ agreement); the 8-mer agrees with letters l-8 throughl-1 (‘type 1’ agreement); or, the 8-mer agrees with letters l-9 throughl-2 (‘type 2’ agreement). Of the 72 reported conserved 8-mers, 39 werein ‘type 0’ agreement, 10 in ‘type 1’ agreement, and seven in ‘type 2’agreement with one or more pyknons from P_(3′UTR). Six of the 8-mers didnot match at all any of the pyknons in P_(3′UTR). In summary, thepyknons that we have derived by intragenomic analysis overlap with 56out of the 72 motifs that were discovered through cross-speciescomparisons.

Human pyknons are also present in other genomes where they associatewith similar biological processes. In FIG. 19, and for each of 7 genomesin turn, we show how many positions in region X of the genome at handare covered by the human pyknons contained in set P_(X),X={5′UTR,CODING,3′UTR}. We account for length differences across genomesby reporting the number of covered positions per 10,000 nucleotides.FIG. 20 shows how many of the human pyknons contained in set P_(X) canalso be found in the region X of the genome under consideration,X={5′UTR,CODING,3′UTR}. FIG. 20 also shows the total number ofintergenic and intronic positions covered by those of the human pyknonsthat are also in other genomes. Notably, the human genome contains morethan 600 million nucleotides that are associated with identical copiesof pyknons and are absent from the mouse and rat genomes. Interestingly,the human pyknons have many instances in the intergenic and intronicregions of the phylogenetically distant worm and fruit-fly genomescovering about 1.6 million nucleotides in each.

A set of 6,160 human-genome-derived pyknons are simultaneously presentin human 3′UTRs (5,752 genes) and mouse 3′UTRs (4,905 genes) whereas asecond set of 388 pyknons are simultaneously present in human 3′UTRs(565 genes), mouse 3′UTRs (673 genes) and fruit-fly 3′UTRs (554 genes).Strikingly, we found these two sets of common pyknons to besignificantly over-represented in the same biological processes in theseother genomes (i.e. mouse and fruit-fly) as in the human genome, eventhough the pyknons were initially discovered by processing the humangenome in isolation (data not shown). The common processes includeregulation of transcription, cell communication, signal transductionetc. Finally, for each of the 388 pyknons in this second set, wemanually analyzed about 130 nucleotide-long neighborhoods centered onthe instances of each pyknon across the human, mouse and fruit-fly3′UTRs and for a total of more than 4,000 such neighborhoods: notably,we did not find any instance of syntenic conservation across the threegenomes.

Accordingly, as explained above, we explored the existence ofsequence-based links between coding and non-coding regions of the humangenome and identified 127,998 pyknons with a combined 226,874non-overlapping instances in the 5′UTRs, CRs or 3′ UTRs of 30,675transcripts from 20,059 human genes. In transcripts that containedmultiple pyknon instances, we found that the pyknons arrange themselvescombinatorially forming mosaics. Statistical analysis revealed that theuntranslated and/or coding regions of genes associated with specificbiological processes are significantly enriched/depleted in pyknons.

We also found that the pyknon placement in 5′UTRs, CRs and 3′UTRs isstrongly biased: the starting positions of consecutive pyknons show aclear preference for distances between 18 and 31 nucleotides.Importantly, we found an apparent lack of correlation betweenconsecutive pyknon instances in these regions. The observed bias in therelative placement of the pyknons is conspicuously reminiscent oflengths that are associated with small RNA molecules that induce PTGS,suggesting the hypothesis that the pyknons' instances correspond tobinding sites for microRNAs. Analysis of the regions immediatelysurrounding the intergenic and intronic instances of the reversecomplement of the 127,998 discovered pyknons revealed that 30.0% of thepyknons have instances within about 380,000 distinct, non-overlappingregions between 60 and 80 nucleotides in length that are predicted tofold into hairpin-shaped RNA secondary structures with folding energies≦−30 Kcal/mol. Many of these predicted hairpin-shaped structures arelocated inside known introns and thus will be transcribed. Our analysisalso suggests that PTGS may be effected though the genes' 5′UTR andamino acid regions, in addition to their 3′UTRs. Another resultinghypothesis is that RNAi products in animals likely fall into distinctcategories that are quantized in terms of size and have preferences forlengths of 18, 22, 24, 26, 29, 30 and 31 nucleotides. Notably, throughsequence-based analysis, we showed that about 40% of the known microRNAsare similar to 689 pyknons, and that the pyknons subsume 56 of the 72recently reported 3′UTR motifs, lending further support to thepossibility of a connection between the pyknons and RNAi/PTGS.

The intergenic/intronic copies of the 127,998 pyknons constrainapproximately 900 million nucleotides of the human genome. Instances ofhuman pyknons can also be found in other genomes namely C. elegans, D.melanogaster, G. gallus, M. musculus, R. norvegicus and C. familiaris.The number of human pyknons that can be located in the 5′UTRs, CRs and3′UTRs of other genomes decreases with phylogenetic distance.Strikingly, the pyknons that we found inside mouse and fruit-fly 3′UTRswere over-represented in the same biological processes as in the humangenome. On a related note, more than 600 million bases, which correspondto identically conserved intergenic and intronic copies of humanpyknons, are not present in the mouse and rat genomes.

The fact that some of the intergenic/intronic copies of pyknonsoriginate in repeat elements may lead one to assume that our analysishas merely ‘rediscovered’ such elements. However, as mentioned above,more than 50,000 of the pyknons have many of their instances inrepeat-free regions. Moreover, the typical length of a pyknon issubstantially smaller than, e.g., that of an ALU. It was recentlyreported that genes can achieve evolutionary novelty through the‘careful’ incorporation of ALUs in their coding regions (Iwashita, S.,Osada, N., Itoh, T., Sezaki, M., Oshima, K., Hashimoto, E.,Kitagawa-Arita, Y., Takahashi, I., Masui, T., Hashimoto, K. &Makalowski, W. (2003) Mol Biol Evol 20, 1556-63; and Lev-Maor, G.,Sorek, R., Shomron, N. & Ast, G. (2003) Science 300, 1288-91). Also, the“pack-mule” paradigm revealed that entire genes, large fragments from asingle gene, or fragments from multiple genes can be ‘hijacked’ bytransposable elements (Jiang, N., Bao, Z., Zhang, X., Eddy, S. R. &Wessler, S. R. (2004) Nature 431, 569-73). However, ‘fortuitouscoincidence’ is generally considered the prevailing mechanism by whichsuch potential is unleashed. Contrasting this, the combinatorialarrangement of the pyknons within the untranslated and coding regions ofgenes together with the large number of instances in these regions andthe association of pyknons with specific biological processes suggeststhat their placement is not accidental and likely serves a specificpurpose. Our findings do not rule out a link with transposable elements.On the contrary, the findings seem to support a dynamic view of a genome(Jorgensen, R. A. (2004) Cold Spring Harb Symp Quant Biol 69, 349-54)that has leamed to respond, and likely continues to respond, toenvironmental challenges or “stress” in a controlled, organized manner.

Taken together, the results suggest the existence of an extensive linkbetween the non-coding and gene-coding parts in animal genomes. It isconceivable that this link could be the result of integration into thegenome of dsRNA-breakdown products. Since many genes are known to giverise to antisense transcripts, it is possible that these genes were atsome point subjected to RNAi-mediated dsRNA breakdown which in turn gaverise to products about 20 nucleotides in length. The latter, throughrepeated integration, could have eventually given rise to the numerousintergenic and intronic copies of the pyknons that we have identified.However, this explanation would have to be reconciled with four of ourfindings. First, the pyknons have identically conserved copies innon-genic regions. Second, pyknons appear to favor a specific size and,in genic regions, a specific relative placement. Third, slightmodification of the 3′UTR instances of the pyknons by either prependingor appending immediately neighboring positions results in new stringswhose intergenic and intronic copies are markedly decreased. And fourth,we can discover human pyknons in other organisms such as the mouse andthe fruit-fly where they exhibit a persistent enrichment within specificprocesses yet are not the result of syntenic conservation. It may wellbe that we are seeing traces of an organized, coordinated activity thatinvolves nearly all known genes. The existence of a pyknon-basedregulatory layer that is massive in scope and extent, originates in thenon-coding part of the genome, operates through the genes' untranslatedand coding regions, and, is likely linked to PTGS, is a tantalizingpossibility. Moreover, the observed disparity in the number ofintergenic/intronic positions covered by human pyknons in the human andthe phylogenetically-close mouse/rat genomes suggests that pyknons andthus the presumed regulatory layer may be organism-specific to somedegree (“pyknome”). Addressing such questions might eventually helpexplain the apparent lack of correlation between the number ofamino-acid coding genes in an organism and the organism's apparentcomplexity.

In the above description, and in order to identify motifs that arepresent in both non-genic and genic regions, we proceeded by firstcarrying out pattern discovery in the intergenic and intronic regions ofthe human genome. Once those patterns were determined, we identifiedadditional instances for them in the genic regions of the genome and inparticular in the 5′ untranslated, amino acid coding and 3′ untranslatedregions of the genes. In other words, the computational analysis flowedfrom the non-genic to the genic-regions. But there is nothing thatinherently prevents us from carrying out the computation in the otherdirection, i.e., from the genic to the non-genic regions, although thereis potential for a loss in sensitivity that might result in theidentification of smaller sets of motifs linking non-genic with genicregions. One could carry out the genic/non-genic analysis in a number ofways. For example, one could use a pattern discovery method to processthe full collection of 5′ untranslated, amino acid coding and 3′untranslated regions (with the regions processed separately ortogether), identifying recurrent motifs contained therein, and finallyestablishing links with the non-genic regions of the genome by locatingthe intergenic and intronic copies for these motifs.

Instead of working with the full length sequences of the genes'untranslated and coding regions, an alternative method would be todelineate areas of interest in these regions (effectively subselecting),analyzing those areas to derive motifs, and finally locating additionalinstances of these motifs in the non-genic parts of the genome. Suchareas of interest could, for example, be known or putative microRNAbinding sites. Alternatively, the areas of interest could be what, inour work on the problem of RNA interference, we refer to as “targetislands.” A detailed description of the work is described in the U.S.patent application identified as Ser. No. 11/351,821, filed on Feb. 10,2006, and entitled “System and Method for Identification of MicroRNATarget Sites and Corresponding Targeting MicroRNA Sequences,” thedisclosure of which is incorporated herein.

Summarily, our approach for finding microRNA target sites is known asrna22 and proceeds as follows: it discovers statistically significantpatterns that are contained in the sequences of known microRNAs,generates their reverse complement, identifies all the instances ofthese reverse-complement patterns in a region of interest (namely one of5′UTRs, CRs or 3′UTRs) and finally reports groups of consecutivelocations from the region of interest as long as they are ‘hit’ aminimum number of times by these patterns. Generally, the groups ofconsecutive locations that rna22 reports will be variable in length andmay correspond to one or more binding sites: consequently, and so as tonot loose generality, we have been referring to them as “targetislands.”

Let us assume that target islands are available for the region ofinterest. One could proceed by doing an all-against-all comparison ofthe target islands forming clusters. Any two target-islands that end upin the same cluster have the property that their corresponding sequencesshare a substantial portion of their extent, say a minimum of Nlocations. Initially, each target island is in its own cluster. There isalways the possibility that the thresholds used in the various stages ofthe process are too stringent, thus resulting in the method to miss sometarget-islands that could have otherwise become members of some clusterc. In order to account for this, one could enhance the cluster-formingprocess as follows. Using the Clustal-W multiple alignment algorithm(Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T. J., Higgins,D. G. & Thompson, J. D. (2003) Nucleic Acids Res 31, 3497-500), we couldalign the sequences in cluster c and extract the core region of thealignment, then use it to search the sequences of interest for instancesof the core region that were skipped because of the employed thresholds.If a given cluster contains more than one core regions then it can bereplaced by as many new clusters as the number of its distinct coreregions. For each one of the formed clusters whose core region thatresulted from the Clustal-W alignment of its members is at least Nnucleotides in length, we report the region as a (genic) motif.

Optionally, one can discard core regions that exhibit low-complexityusing the NSEG algorithm (Wootton, J. C. & Federhen, S. (1993) Computersin Chemistry 17, 149-163). These motifs are then sought in thecorresponding genome's intergenic/intronic regions instances toestablish links between coding and non-coding parts of the genome.Finally, it is clear that instead of clustering the target islands todetermine motifs, one could simply use a pattern discovery approach andsubselect among the reported patterns to keep only those that, forexample, satisfy a minimum length requirement, or some other property.

Given the above description, a few points should be noted. First, it isclear that the method which we have described and the ensuing analysisis not specific to the human genome; in fact, it can be carried outseparately in other eukaryotic genomes such as chimpanzee, mouse, rat,dog, chicken, fruit-fly, worm, etc. It is expected that the resultingpyknomes will have non-zero intersections with one another but willlikely also contain organism-specific pyknons. Whether generated fromthe human or some other genome, the pyknons are statisticallysignificant and link the non-genic and genic regions of the genome athand. The links that are instantiated by the pyknons are ‘natural’ inthat they involve large numbers of sequences that occur naturally in thegenome at hand. Consequently, the pyknons would form natural candidatesfor a number of processes that to date have been carried out usingschemes that make use of local information alone and do not take intoaccount long-range conservations of the kind that we presented in ourdiscussion.

One such application would be the design of small interfering RNAs(siRNAs) to regulate the gene expression of one specific gene. Some ofour pyknons have the property of being shared by two or more genes whichallows the design of siRNAs that can interfere with a cluster of genesat once. As illustrated in the flow diagram of FIG. 21, a method fordesigning one or more sequences of siRNAs that can interact with one ormore sites in a given transcript of a given sequence in a given organismand result in the down-regulation of the expression of the proteinproduct encoded by the given transcript can comprise the followingsteps. One or regions of interest are identified in the sequence of agiven transcript (step 2102). One or more regions are sub-selected fromthe collection of these regions (step 2104). One or more derivedsequences are generated from the sequence of the one or moresub-selected regions (step 2106). The one or more derived sequences areused to create one more instances of the corresponding molecule that theone or more derived sequences represent (step 2108). The one or moreinstances of the created molecule are used in an appropriate environmentto regulate the expression of the given transcript (step 2110).

Further, the method of designing one or more siRNAs may use a region ofinterest in the collection of regions of interest identified to be aninstance of a motif that has one or more copies in the intergenic andintronic regions of the genome of interest, and one or more copies inthe untranslated and amino acid coding regions of one or more genes inthe genome of interest, each such region of interest being computedusing the method and system for finding pyknons described above.

The method may use a region of interest in the collection of regions ofinterest identified using a method that is based on pattern discovery,for example, the method described in the above-referenced U.S. patentapplication identified as Ser. No. 11/351,821. A region of interest inthe collection of regions of interest can also be identified to be atarget island that is computed using the method also described in theabove-referenced U.S. patent application identified as Ser. No.11/351,821.

The method of designing one or more siRNAs may also use a region ofinterest, for example, located in the 5′ untranslated region of thegiven transcript, located in the amino acid coding region of the giventranscript, or located in the 3′ untranslated region of the giventranscript.

As detailed above, the method of designing one or more siRNAs can beused where the genome of interest is a eukaryotic genome, and whereinthe eukaryotic genome is, for example, is the human genome, the mousegenome, the rat genome, the dog genome, the fruit fly genome, or theworm genome.

Also, the method of designing one or more siRNAs may use a region ofinterest that is sub-selected based on one or more of its attributes.These attributes may include, for example, the region's length and theregion's location in the transcript.

The method of designing one or more siRNAs can also use a derivedsequence that is, for example, the reverse complement of the sequence ofthe one or more sub-selected regions, or a near-reverse complement ofthe sequence of the one or more sub-selected regions, i.e. it containsmismatches at one or more locations.

The method of designing one or more siRNAs can be used such that the oneor more copies of the molecule can be built using any of a set ofbiochemical processes.

Another application would involve the rational use of pyknons toappropriately engineer a transcript of interest in order to control itsexpression (either up-regulate or down-regulate) in a specific tissue orfor a specific cellular process. For example, one could remove one ormore of the pyknons existing in the transcript of interest leading to anup-regulation of the transcript. Alternatively, one could down-regulatethe transcript of interest by adding more instances of existing pyknonsand rely on the naturally occurring agent that targets this pyknon toinduce down-regulation. Or one could add the sequence of a pyknon thatis not among those contained in the transcript and selective control thetranscript's expression by adding or removing appropriately generatedinstances of the reverse complement of the pyknon.

As illustrated in the flow diagram of FIG. 22, a method for engineeringa given transcript of a given gene in a given organism in order toregulate its expression may comprise the following steps. One or moreregions of interest are identified in the sequence of a given transcript(step 2202). One or more regions are sub-selected from the collection ofthese regions (step 2204). The one or more sub-selected regions are usedto make one or more modifications to the sequence of the giventranscript (step 2206).

Further, the method of engineering a given transcript to regulate geneexpression can comprise many of the same steps as mentioned above in themethod for designing one or more siRNAs. For example, the method ofengineering a given transcript to regulate gene expression may use aregion of interest in the collection of regions of interest identifiedto be an instance of a motif that has one or more copies in theintergenic and intronic regions of the genome of interest, and one ormore copies in the untranslated and amino acid coding regions of one ormore genes in the genome of interest. The motif can be computed, forexample, using the pyknons discovery method and system described above.

Also, as above, the method of engineering a given transcript to regulategene expression may use a region of interest in the collection ofregions of interest computed using a method that is based on patterndiscovery, for example, the method described in the above-referencedU.S. patent application identified as Ser. No. 11/351,821.

The present method may also use a region of interest, for example,located in the 5′ untranslated region of the given transcript, locatedin the amino acid coding region of the given transcript, or located inthe 3, untranslated region of the given transcript.

Also, similar to the above methodology, the method of engineering agiven transcript to regulate gene expression may use a region ofinterest that is sub-selected based on one or more of its attributesincluding, for example, the region's length and the region's location inthe transcript. Additional attributes may include the association of theregion with a given biological process, the region's association with agiven tissue, and the region's association with a given cellularcompartment.

Further, the method of engineering a given transcript to regulate geneexpression can include a modification that, for example, comprises anextension of the sequence of the given transcript, or a shortening ofthe sequence of the given transcript. The extension can, for example,comprise one or more instances of a region of interest, and theshortening can, for example, comprise one or more instances of a regionof interest.

Another application of pyknons, for example, would be the measuring ofthe impact that one or more pyknons can have on a gene's regulation “byproxy.” This would entail the engineering of an assay that involves areporter gene (for example, luciferase) and instances of the one or morepyknons placed downstream from the region that codes for the reporter'samino acid sequence. Then, one can measure the impact on the expressionof the reporter gene by using various combinations of appropriatelygenerated instances of the reverse complement of these pyknons. Theobservations made in the context of the reporter assay can then becarried over to the gene that is studied. Additional applications arealso possible if one assumes that for the organism that is being studiedthe sequences of the corresponding pyknons are available.

FIG. 23 is a block diagram of an exemplary hardware implementation ofone or more of the methodologies of the present invention. That is,apparatus 2300 may implement one or more of the steps/componentsdescribed above in the context of FIGS. 1-22. Apparatus 2300 comprises acomputer system 2310 that interacts with media 2350. Computer system2310 comprises a processor 2320, a network interface 2325, a memory2330, a media interface 2335 and an optional display 2340. Networkinterface 2325 allows computer system 2310 to connect to a network,while media interface 2335 allows computer system 2310 to interact withmedia 2350, such as a Digital Versatile Disk (DVD) or a hard drive.

As is known in the art, the methods and apparatus discussed herein maybe distributed as an article of manufacture that itself comprises acomputer-readable medium having computer-readable code means embodiedthereon. The computer-readable program code means is operable, inconjunction with a computer system such as computer system 2310, tocarry out all or some of the steps to perform one or more of the methodsor create the apparatus discussed herein. For example, thecomputer-readable code is configured to implement a method ofdetermining associations between non-coding sequences and gene codingsequences in a genome of an organism, by the steps of: identifying atleast one conserved region from a plurality of the non-coding sequences;and linking the at least one conserved region with one or more of thegene coding sequences of the genome to associate the at least oneconserved region with one or more biological processes of the organism.The computer-readable medium may be a recordable medium (e.g., floppydisks, hard drive, optical disks such as a DVD, or memory cards) or maybe a transmission medium (e.g., a network comprising fiber-optics, theworld-wide web, cables, or a wireless channel using time-divisionmultiple access, code-division multiple access, or other radio-frequencychannel). Any medium known or developed that can store informationsuitable for use with a computer system may be used. Thecomputer-readable code means is any mechanism for allowing a computer toread instructions and data, such as magnetic variations on a magneticmedium or height variations on the surface of a compact disk.

Memory 2330 configures the processor 2320 to implement the methods,steps, and functions disclosed herein. The memory 2330 could bedistributed or local and the processor 2320 could be distributed orsingular. The memory 2330 could be implemented as an electrical,magnetic or optical memory, or any combination of these or other typesof storage devices. Moreover, the term “memory” should be construedbroadly enough to encompass any information able to be read from orwritten to an address in the addressable space accessed by processor2320. With this definition, information on a network, accessible throughnetwork interface 2325, is still within memory 2330 because theprocessor 2320 can retrieve the information from the network. It shouldbe noted that each distributed processor that makes up processor 2320generally contains its own addressable memory space. It should also benoted that some or all of computer system 2310 can be incorporated intoan application-specific or general-use integrated circuit.

Optional video display 2340 is any type of video display suitable forinteracting with a human user of apparatus 2300. Generally, videodisplay 2440 is a computer monitor or other similar video display.

Although illustrative embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may bemade by one skilled in the art without departing from the scope orspirit of the invention.

1. A method of designing one or more sequences of short interfering RNAsthat can interact with one or more sites in a given transcript of agiven sequence in a given organism and result in the down-regulation ofthe expression of a protein product encoded by the given transcript, themethod comprising the steps of: identifying one or regions of interestin the sequence of the given transcript; sub-selecting one or moreregions from the collection of the regions of interest; generating oneor more derived sequences from the sequence of the one or moresub-selected regions; using the one or more derived sequences to createone or more instances of a corresponding molecule that the one or morederived sequences represent; and using the one or more instances of thecreated molecule in an appropriate environment to regulate theexpression of the given transcript.
 2. The method of claim 1, wherein aregion of interest in the collection of regions of interest isidentified to be an instance of a motif that has one or more copies inintergenic and intronic regions of the genome of interest and one ormore copies in untranslated and amino acid coding regions of one or moregenes in the genome of interest.
 3. The method of claim 1, wherein aregion of interest in the collection of regions of interest isidentified using a method that is based on pattern discovery.
 4. Themethod of claim 1, wherein a region of interest in the collection ofregions of interest is identified to be a target island.
 5. The methodof claim 1, wherein a region of interest is located in a 5′ untranslatedregion of the given transcript.
 6. The method of claim 1, wherein aregion of interest is located in an amino acid coding region of thegiven transcript.
 7. The method of claim 6, wherein a region of interestis located in a 3′ untranslated region of the given transcript.
 8. Themethod of claim 1, wherein the genome of interest is a eukaryoticgenome.
 9. The method of claim 8, where the eukaryotic genome is a humangenome.
 10. The method of claim 8, wherein the eukaryotic genome is amouse genome.
 11. The method of claim 8, wherein the eukaryotic genomeis a rat genome.
 12. The method of claim 8, wherein the eukaryoticgenome is a dog genome.
 13. The method of claim 8, wherein theeukaryotic genome is a fruit fly genome.
 14. The method of claim 8,wherein the eukaryotic genome is a worm genome.
 15. The method of claim1, wherein a region of interest is sub-selected based on one or more ofits attributes.
 16. The method of claim 15, wherein an attribute islength of the region.
 17. The method of claim 15, wherein an attributeis location of the region in the transcript.
 18. The method of claim 1,wherein a derived sequence is a reverse complement of the sequence ofthe one or more sub-selected regions.
 19. The method of claim 1, whereina derived sequence is a near-reverse complement of the sequence of theone or more sub-selected regions.
 20. The method of claim 1, wherein theone or more copies of the molecule can be built using any of a set ofbiochemical processes.